Search CORE

2,337 research outputs found

Gaussian Quadrature for Kernel Features

Author: Dao Tri
De Sa Christopher
Ré Christopher
Publication venue
Publication date: 31/01/2018
Field of study

Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that

O(\epsilon^{-2})

samples are required to achieve an approximation error of at most

\epsilon

. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any

\gamma > 0

, to achieve error

\epsilon

with

O(e^{e^\gamma} + \epsilon^{-1/\gamma})

samples as

\epsilon

goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.Comment: Neural Information Processing Systems (NIPS) 201

arXiv.org e-Print Archive

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

Author: De Sa Christopher
Olukotun Kunle
Ré Christopher
Publication venue
Publication date: 16/06/2016
Field of study

Gibbs sampling is a Markov chain Monte Carlo technique commonly used for estimating marginal distributions. To speed up Gibbs sampling, there has recently been interest in parallelizing it by executing asynchronously. While empirical results suggest that many models can be efficiently sampled asynchronously, traditional Markov chain analysis does not apply to the asynchronous case, and thus asynchronous Gibbs sampling is poorly understood. In this paper, we derive a better understanding of the two main challenges of asynchronous Gibbs: bias and mixing time. We show experimentally that our theoretical results match practical outcomes

arXiv.org e-Print Archive

Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width

Author: De Sa Christopher
Olukotun Kunle
Ré Christopher
Zhang Ce
Publication venue
Publication date: 02/10/2015
Field of study

Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width---regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers

arXiv.org e-Print Archive

Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms

Author: De Sa Christopher
Olukotun Kunle
Ré Christopher
Zhang Ce
Publication venue
Publication date: 02/10/2015
Field of study

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware

arXiv.org e-Print Archive

Parallel SGD: When does averaging help?

Author: De Sa Christopher
Mitliagkas Ioannis
Ré Christopher
Zhang Jian
Publication venue
Publication date: 23/06/2016
Field of study

Consider a number of workers running SGD independently on the same pool of data and averaging the models every once in a while -- a common but not well understood practice. We study model averaging as a variance-reducing mechanism and describe two ways in which the frequency of averaging affects convergence. For convex objectives, we show the benefit of frequent averaging depends on the gradient variance envelope. For non-convex objectives, we illustrate that this benefit depends on the presence of multiple globally optimal points. We complement our findings with multicore experiments on both synthetic and real data

arXiv.org e-Print Archive

Data Programming: Creating Large Training Sets, Quickly

Author: De Sa Christopher
Ratner Alexander
Ré Christopher
Selsam Daniel
Wu Sen
Publication venue
Publication date: 08/01/2017
Field of study

Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable

arXiv.org e-Print Archive

Accelerated Stochastic Power Iteration

Author: De Sa Christopher
He Bryan
Mitliagkas Ioannis
Ré Christopher
Xu Peng
Publication venue
Publication date: 09/07/2017
Field of study

Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, requires

\mathcal O(1/\Delta)

full-data passes to recover the principal component of a matrix with eigen-gap

\Delta

. Lanczos, a significantly more complex method, achieves an accelerated rate of

\mathcal O(1/\sqrt{\Delta})

passes. Modern applications, however, motivate methods that only ingest a subset of available data, known as the stochastic setting. In the online stochastic setting, simple algorithms like Oja's iteration achieve the optimal sample complexity

\mathcal O(\sigma^2/\Delta^2)

. Unfortunately, they are fully sequential, and also require

\mathcal O(\sigma^2/\Delta^2)

iterations, far from the

\mathcal O(1/\sqrt{\Delta})

rate of Lanczos. We propose a simple variant of the power iteration with an added momentum term, that achieves both the optimal sample and iteration complexity. In the full-pass setting, standard analysis shows that momentum achieves the accelerated rate,

\mathcal O(1/\sqrt{\Delta})

. We demonstrate empirically that naively applying momentum to a stochastic method, does not result in acceleration. We perform a novel, tight variance analysis that reveals the "breaking-point variance" beyond which this acceleration does not occur. By combining this insight with modern variance reduction techniques, we construct stochastic PCA algorithms, for the online and offline setting, that achieve an accelerated iteration complexity

\mathcal O(1/\sqrt{\Delta})

. Due to the embarassingly parallel nature of our methods, this acceleration translates directly to wall-clock time if deployed in a parallel environment. Our approach is very general, and applies to many non-convex optimization problems that can now be accelerated using the same technique.Comment: 37 pages, 5 figure

arXiv.org e-Print Archive

Incremental Knowledge Base Construction Using DeepDive

Author: De Sa Christopher
Ré Christopher
Shin Jaeho
Wang Feiran
Wu Sen
Zhang Ce
Publication venue
Publication date: 15/06/2015
Field of study

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality

arXiv.org e-Print Archive

A Kernel Theory of Modern Data Augmentation

Author: Dao Tri
De Sa Christopher
Gu Albert
Ratner Alexander J.
Ré Christopher
Smith Virginia
Publication venue
Publication date: 20/03/2019
Field of study

Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear naturally with respect to this model, even when we do not employ kernel classification. Next, we analyze more directly the effect of augmentation on kernel classifiers, showing that data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. These frameworks both serve to illustrate the ways in which data augmentation affects the downstream learning model, and the resulting analyses provide novel connections between prior work in invariant kernels, tangent propagation, and robust optimization. Finally, we provide several proof-of-concept applications showing that our theory can be useful for accelerating machine learning workflows, such as reducing the amount of computation needed to train using augmented data, and predicting the utility of a transformation prior to training

arXiv.org e-Print Archive

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

Author: De Sa Christopher
Dotzel Jordan
Hu Yuwei
Zhang Zhiru
Zhao Ritchie
Publication venue
Publication date: 22/05/2019
Field of study

Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs and specialized accelerators. The majority of existing literature focuses on training quantized DNNs, while this work examines the less-studied topic of quantizing a floating-point model without (re)training. DNN weights and activations follow a bell-shaped distribution post-training, while practical hardware uses a linear quantization grid. This leads to challenges in dealing with outliers in the distribution. Prior work has addressed this by clipping the outliers or using specialized hardware. In this work, we propose outlier channel splitting (OCS), which duplicates channels containing outliers, then halves the channel values. The network remains functionally identical, but affected outliers are moved toward the center of the distribution. OCS requires no additional training and works on commodity hardware. Experimental evaluation on ImageNet classification and language modeling shows that OCS can outperform state-of-the-art clipping techniques with only minor overhead.Comment: 10 pages; update to ICML camera-ready versio

arXiv.org e-Print Archive